ipc: use shared memory for large events by matthew-levan · Pull Request #972 · urbit/vere

matthew-levan · 2026-02-27T20:11:19Z

Shared Memory Plea Protocol

Large pokes (file system commits, etc.) sent from urth to mars previously
travelled over the Unix pipe using the standard newt/jam/cue path. For
payloads above ~256 MiB this caused severe memory pressure and, for payloads
approaching 2 GiB, a process segfault.

This document describes the replacement: a POSIX shared memory fast-path that
bypasses the pipe for large events, copies the raw loom noun structure directly
(no jam/cue), and keeps peak memory well within reach of a 16 GiB machine.

Problem

The standard path for sending an event from urth to mars is:

Urth jams the noun → ~2 GiB C-heap buffer (dat_y)
Urth writes dat_y over the pipe (5-byte-header newt framing)
Mars reads the pipe, cues the bytes back into a loom noun

For a 2 GiB file commit this requires:

Urth: ~2 GiB C-heap for the jammed bytes
Mars: ~2 GiB C-heap for the cue dictionary (transient) + ~2 GiB loom for
the decoded noun + ~2 GiB loom for the re-jammed LMDB event = ~6 GiB peak
The pipe itself becomes a congestion point at multi-GiB sizes

Design

Protocol

A new %plea message type is added to the urth↔mars IPC protocol (alongside the
existing %poke, %peek, %live, etc.):

urth → mars   [%plea len=@ud]          request: allocate shm of len bytes
mars → urth   [%plea nam=@t len=@ud]   response: shm name + confirmed length
urth → mars   [%done ~]               urth has filled shm; proceed

The normal %poke response from mars back to urth is unchanged; the plea writ
is converted to a poke writ in-place before %done is sent so that mars's
eventual [%poke ...] reply matches through the standard writ-queue path.

Threshold

The plea path is taken when the serialized noun size exceeds _UNIX_PLEA_THRESHOLD
(256 MiB), currently triggered from _unix_update_mount in pkg/vere/io/unix.c.

Shared memory ownership

Mars creates the shm object (shm_open + ftruncate), sends the name to
urth, then waits for %done.
Urth opens the same shm by name, calls the fill callback (writes the noun),
then munmaps and sends %done. Urth never owns the shm region.
Mars receives %done, mmaps the shm read-only, deserializes the noun, then
munmaps and shm_unlinks before continuing.

Noun serialization: raw loom copy (no jam/cue)

Instead of jam/cue, the shm buffer holds a compact binary encoding of the raw
loom noun structure, implemented in pkg/noun/allocate.c:

u3a_noun_shm_size(u3_noun som) → c3_d
: DFS traversal counting bytes needed. Handles DAG sharing via a
ur_dict64_t (loom offset → sentinel). Returns total byte length including
the 16-byte header.

u3a_noun_to_shm(u3_noun som, c3_y* shm_y, c3_d cap_d) → c3_d
: Iterative post-order DFS. Writes each unique indirect object (atom or cell)
exactly once in child-before-parent order. Returns bytes written.

u3a_noun_from_shm(const c3_y* shm_y, c3_d len_d) → u3_weak
: Single-dict two-phase deserializer (see below). Returns the root noun
allocated on the current road, or u3_none on error.

SHM buffer format

offset  size  field
     0     8  root_noun   -- root noun in shm-offset space (see tags below)
     8     8  data_len    -- byte length of data section
    16  ...   data        -- allocations in DFS post-order

Noun values in shm-offset space use the top two bits as a tag:

00xxxxxxx… — direct atom (fits in 62 bits); stored as-is
10xxxxxxx… — indirect atom; low 62 bits = byte offset into data section
11xxxxxxx… — cell; low 62 bits = byte offset into data section

Each atom entry in the data section:

+0   8  len_w    -- number of 64-bit data words (> 0 distinguishes from cell)
+8   4  mug_h    -- cached mug
+12  4  (pad)
+16  len_w*8     -- atom data words, LSB-first

Each cell entry in the data section:

+0   8  tag=0    -- zero distinguishes from atom
+8   4  mug_h
+12  4  (pad)
+16  8  hed      -- head noun in shm-offset space
+24  8  tel      -- tail noun in shm-offset space

`u3a_noun_from_shm`: single-dict two-phase approach

A single ur_dict64_t serves both phases, halving peak C-heap vs a two-dict
approach:

Phase 1 (linear scan): for each cell entry, count how many times each
shm offset appears as hed, tel, or root. Store dict[shm_off] = refcount.
Phase 2 (linear scan): for each entry, read use_d = dict[shm_off],
allocate the loom noun with use_w = use_d, then overwrite
dict[shm_off] = loom_noun.

This is safe because data is written in post-order: when phase 2 encounters a
cell, both children have already been processed and their dict entries already
hold the resolved loom nouns.

Dict pre-sizing

Large nouns would cause many costly resize generations. Both u3a_noun_to_shm
and u3a_noun_from_shm pre-size their ur_dict64_t via _shm_dict_init, which
picks the smallest fibonacci pair such that the initial bucket count can hold the
estimated node count (dat_d / _SHM_CELL_SIZE) without resizing. The fibonacci
table in pkg/ur/defs.h was extended from ur_fib34 through ur_fib36 to cover
the required range.

Files Changed

File	Change
`pkg/c3/motes.h`	Added `c3__plea` mote
`pkg/ur/defs.h`	Added `ur_fib29`–`ur_fib36`
`pkg/vere/vere.h`	Added `u3_writ_plea` enum value; `pla_u` struct in `u3_writ` union; `u3_lord_plea()` declaration
`pkg/vere/mars.h`	Added `u3_mars_plea_e` state; `pla_u` struct in `u3_mars`
`pkg/vere/lord.c`	`_lord_plea_plea()` handler; `u3_lord_plea()` public API; `%plea` dispatch in writ machinery
`pkg/vere/mars.c`	`%plea` and `%done` cases in `_mars_work()`; state guard in `u3_mars_kick()`
`pkg/vere/io/unix.c`	`_unix_plea_ctx`, `_unix_plea_fill`, plea branch in `_unix_update_mount()`
`pkg/noun/allocate.h`	Declarations for `u3a_noun_shm_size`, `u3a_noun_to_shm`, `u3a_noun_from_shm`
`pkg/noun/allocate.c`	Full implementations of the above; `_shm_dict_init` helper

Test Results: 2 GiB File Commit

Tested on Apple M-series (ARM64, macOS 26.3), 16 GiB RAM, --urth-loom 34
(16 GiB virtual loom). The commit consisted of a single ~2 GiB binary file
written via the Clay Unix mount. vmmap snapshots taken immediately after
the commit completed (both processes idle, LMDB write in progress).

Process:  urbit [31135]  (urth)    Launch: 18:31:01  Sample: 18:31:53
Process:  urbit [31136]  (mars)    Launch: 18:31:01  Sample: 18:31:54

Urth (31135)

Physical footprint:         6.1 GiB  (peak: 8.0 GiB)

MALLOC_LARGE                78.5 MiB   1 region   (live — LMDB dat_y buffer)
MALLOC_LARGE (empty)         2.0 GiB  18 regions  (freed shm serialize dict)
VM_ALLOCATE                  4.1 GiB  35 regions  (urth loom dirty pages)
DefaultMallocZone           79.3 MiB  live         (incl. LMDB buffer)

The 2.0 GiB freed dict (18 regions) reflects the pre-sized old_to_shm
dict from u3a_noun_to_shm. Before pre-sizing this was 33 regions and
was still live (4.0 GiB) when sampled mid-serialize.
The 78.5 MiB live MALLOC_LARGE is the jammed event buffer (u3_feat::hun_y
in disk.c) held pending async LMDB write — unavoidable given the jam-based
event log.
The 4.1 GiB VM_ALLOCATE is the urth loom's dirty pages (~2 GiB noun +
working set), all of which are MAP_ANON | MAP_PRIVATE.

Mars (31136)

Physical footprint:        12.6 GiB  (peak: 14.8 GiB)

MALLOC_LARGE (empty)        3.9 GiB  32 regions  (freed dicts — see below)
VM_ALLOCATE                 8.7 GiB  72 regions  (mars loom + shm region)
DefaultMallocZone           3.6 MiB  live         (no malloc leak)
mapped file                 1.0 TiB  virtual      (LMDB mmap, 13 MiB resident)

The 3.9 GiB freed MALLOC_LARGE is a mix of:
- The shm decode dict (u3a_noun_from_shm, pre-sized, few resize generations)
- The jam dict from u3qe_jam (starts at fib11/fib12, grows through many
  generations for a 2 GiB noun — this is the dominant contributor and is
  independent of the plea protocol)
Peak 14.8 GiB vs current 12.6 GiB: the ~2.2 GiB delta is the jam output
buffer (dat_y) freed after the async LMDB write completed.
The 3.6 MiB live DefaultMallocZone confirms no malloc leak from the
plea/decode path.

Peak breakdown

Phase	Approximate cost
Mars loom (decoded 2 GiB noun + Arvo state)	~6 GiB dirty loom pages
Shm region (owned by mars)	~2 GiB VM_ALLOCATE
Shm decode dict (peak, pre-sized, freed)	~1–2 GiB
Jam output for LMDB (transient, freed)	~2 GiB
Urth loom dirty pages	~4 GiB
Urth serialize dict (transient, freed)	~2 GiB
Combined peak (urth + mars)	~14–15 GiB

Comparison: old pipe path vs plea protocol

Metric	Pipe (jam/cue)	Plea (raw loom copy)
Urth C-heap (jam buffer)	~2 GiB permanent until LMDB done	~2 GiB (same — LMDB buffer)
Urth serialize overhead	none (jam is the buffer)	~2 GiB transient dict
Mars cue dict (transient)	~6 GiB (two ur_dict64_t)	~1–2 GiB (one pre-sized dict)
Mars decoded noun on loom	~6–8 GiB	~6–8 GiB (same)
Pipe congestion	yes — 2 GiB over a Unix pipe	eliminated
Segfault on 2 GiB commit	yes	no

Known Limitations / Future Work

Jam dict pre-sizing: Mars's residual ~2 GiB in MALLOC_LARGE (empty) is
dominated by u3qe_jam's internal dict (used when writing the decoded noun to
the event log). Pre-sizing that dict from the known noun size would reduce mars
peak by ~1–2 GiB.
Threshold tuning: The 256 MiB threshold is conservative. A lower value
(e.g. 64 MiB) would engage the plea path more aggressively but the per-call
overhead (shm creation, two IPC round-trips) is small.
Linux testing: All measurements above are macOS ARM64. The shm path uses
standard POSIX interfaces (shm_open, mmap, munmap, shm_unlink) and
should be portable, but has not yet been profiled on Linux.

… `deed`

…u3_book_meta)`

…limit

…ck` performance

…n the header

…om ~mastyr-bottec

matthew-levan added 30 commits December 31, 2025 18:48

disk: book initial commit

e5f25c2

disk: replaces lmdb with book

8ae7a7e

Merge branch 'develop' into ml/book

8c69f6b

book: formalizes in-memory and on-disk event structures as reed and…

5e7f368

… `deed`

book: uses PRIu64 instead of llu format specifier for portability

6bed267

book: adds failure mode tests

4beab90

book: simplifies metadata api

9dc21ae

book: fixes contiguity validation printf

c4443f9

book: runs book-test in ci

9e1cb8d

disk: renames mdb_u to txt_u

59ae845

book: uses PRIu64 format specifier instead of llu in tests

f3210b3

Merge branch 'develop' into ml/book

e9ad7a3

book: removes unused _book_crc32 function

20b05a7

book: removes BOOK_META_SIZE macro and replaces usage with `sizeof(…

ed807d5

…u3_book_meta)`

book: simplifies header, makes it immutable, and refactors accordingly

861fbda

book: sync first event number of header before writing events to log

e988500

book: improves tests

6f621f7

wip: uses lmdb for top-level metadata

89704e1

Merge branch 'develop' into ml/book

063ebad

disk: remove unused function and use lmdb for top-level metadata

2d62e24

disk: implements book migration

3d994bd

Merge branch 'develop' into ml/book

d158692

book: ensures consistency of scans

4950a2b

book: cleans, clarifies, and adds metadata to stat

720a296

book: fixes leak, improves test safety, removes arbitrary event size …

ef589da

…limit

book: adds eve_d to u3_book_deed_head and improves `_book_scan_ba…

67d1378

…ck` performance

book: ensure las_d is initialized correctly in new, non-zero epochs

b7a6964

book: cleans up and enforces u3_book_init is passed an epoch directory

7c54c40

book: improve scanning semantics

d001c00

book: fixes tests for updated api

8c30faf

matthew-levan and others added 16 commits January 26, 2026 15:44

book: improves code quality

cd4d389

book: adds benchmarks

89cfe42

book: replaces deed event numbers with a tracking last event number i…

5eb10c4

…n the header

book: removes per-event checksums

b636f7e

book: implements lmdb-style double-buffering

02bcb3b

book: cleans double-buffer code

b37552b

book: cleans entire api

70ff357

book: adds _bench_write_speed_mixed according to event histogram fr…

cdbcb1f

…om ~mastyr-bottec

Merge branch 'develop' into ml/book

c9242c2

book: simplify batched writes

58b7efd

book: windows compatibility

65e4f15

book: uses c3_d by default

23d7e9e

book: adds last batch checksum with validation

71c086b

Merge branch 'develop' into ml/book

e90f0d3

Merge branch 'develop' into ml/book

48131d3

Merge branch 'ml/book' into ml/shm

e859ac0

matthew-levan force-pushed the ml/shm branch from b5f084f to e859ac0 Compare March 11, 2026 14:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ipc: use shared memory for large events#972

ipc: use shared memory for large events#972
matthew-levan wants to merge 46 commits intoml/64from
ml/shm

matthew-levan commented Feb 27, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

matthew-levan commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Shared Memory Plea Protocol

Problem

Design

Protocol

Threshold

Shared memory ownership

Noun serialization: raw loom copy (no jam/cue)

SHM buffer format

u3a_noun_from_shm: single-dict two-phase approach

Dict pre-sizing

Files Changed

Test Results: 2 GiB File Commit

Urth (31135)

Mars (31136)

Peak breakdown

Comparison: old pipe path vs plea protocol

Known Limitations / Future Work

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

matthew-levan commented Feb 27, 2026 •

edited

Loading

`u3a_noun_from_shm`: single-dict two-phase approach